Stelios Piperidis 1,2 Infrastructure for a Multilingual Subtitle Generation System
نویسندگان
چکیده
The expansion of digital television and the increasing demand to manipulate audiovisual content underlie the need for tools and systems that will automate the multilingual subtitle generation process. In this setting the MUSA project aims at providing a system which combines speech recognition, advanced text analysis, and machine translation to help generate multilingual subtitles. In its current version the system treats English as source and target language, as far as subtitle generation is concerned, French and Greek as subtitle translation target languages. In order to train and evaluate system components, an array of application specific resources is necessary. Primary audiovisual data consist in BBC TV documentaries. For each programme the set of multifaceted multimedia parallel data captured include: the actual video, its transcript or script, English, Greek and French subtitles, and topically relevant newspaper or web-sourced extracts.
منابع مشابه
Condensing Sentences for Subtitle Generation
Text condensation aims at shortening the length of an utterance without losing essential textual information. In this paper, we report on the implementation and preliminary evaluation of a sentence condensation tool for Greek using a manually constructed table of 450 lexical paraphrases, and a set of rules that delete syntactic subtrees that carry minor semantic information. Evaluation on two s...
متن کاملMultimedia Content Processing and Retrieval in the REVEAL THIS Setting
— The explosion of multimedia digital content and the development of technologies that go beyond traditional broadcast and TV have rendered access to such content important for all end-users of these technologies. REVEAL THIS develops content processing technology able to semantically index, categorise and cross-link multiplatform, multimedia and multilingual digital content, providing the syst...
متن کاملLanguage Resources Production Models: the Case of the INTERA Multilingual Corpus and Terminology
This paper reports on the multilingual Language Resources (MLRs), i.e. parallel corpora and terminological lexicons for less widely digitally available languages, that have been developed in the INTERA project and the methodology adopted for their production. Special emphasis is given to the reality factors that have influenced the MLRs development approach and their final constitution. Buildin...
متن کاملA Term Base Translator Over The Web
This paper is an attempt towards producing a utility to process documents over the Web in diversified formats in a multilingual mode, and produce semi-translated documents from Arabic to English and the other way around, by using a term based approach. The process incorporates morphological analysis techniques of Arabic to handle the canonical forms of vocabularies in the terms stored in the te...
متن کاملParallel Global Voices: a Collection of Multilingual Corpora with Citizen Media Stories
We present a new collection of multilingual corpora automatically created from the content available in the Global Voices websites, where volunteers have been posting and translating citizen media stories since 2004. We describe how we crawled and processed this content to generate parallel resources comprising 302.6K document pairs and 8.36M segment alignments in 756 language pairs. For some l...
متن کامل